Unsupervised Data Base Clustering Based on Daylight's Fingerprint and Tanimoto Similarity: A Fast and Automated Way To Cluster Small and Large Data Sets

نویسنده

  • Darko Butina
چکیده

One of the most commonly used clustering algorithms within the worldwide pharmaceutical industry is Jarvis-Patrick’s (J-P) (Jarvis, R. A. IEEE Trans. Comput. 1973, C-22, 1025-1034). The implementation of J-P under Daylight software, using Daylight’s fingerprints and the Tanimoto similarity index, can deal with sets of 100 k molecules in a matter of a few hours. However, the J-P clustering algorithm has several associated problems which make it difficult to cluster large data sets in a consistent and timely manner. The clusters produced are greatly dependent on the choice of the two parameters needed to run J-P clustering, such that this method tends to produce clusters which are either very large and heterogeneous or homogeneous but too small. In any case, J-P always requires time-consuming manual tuning. This paper describes an algorithm which will identify dense clusters where similarity within each cluster reflects the Tanimoto value used for the clustering, and, more importantly, where the cluster centroid will be at least similar, at the given Tanimoto value, to every other molecule within the cluster in a consistent and automated manner. The similarity term used throughout this paper reflects the oVerall similarity between two given molecules, as defined by Daylight’s fingerprints and the Tanimoto similarity index.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Hybrid Time Series Clustering Method Based on Fuzzy C-Means Algorithm: An Agreement Based Clustering Approach

In recent years, the advancement of information gathering technologies such as GPS and GSM networks have led to huge complex datasets such as time series and trajectories. As a result it is essential to use appropriate methods to analyze the produced large raw datasets. Extracting useful information from large data sets has always been one of the most important challenges in different sciences,...

متن کامل

Cluster-Based Image Segmentation Using Fuzzy Markov Random Field

Image segmentation is an important task in image processing and computer vision which attract many researchers attention. There are a couple of information sets pixels in an image: statistical and structural information which refer to the feature value of pixel data and local correlation of pixel data, respectively. Markov random field (MRF) is a tool for modeling statistical and structural inf...

متن کامل

Clustering of Fuzzy Data Sets Based on Particle Swarm Optimization With Fuzzy Cluster Centers

In current study, a particle swarm clustering method is suggested for clustering triangular fuzzy data. This clustering method can find fuzzy cluster centers in the proposed method, where fuzzy cluster centers contain more points from the corresponding cluster, the higher clustering accuracy. Also, triangular fuzzy numbers are utilized to demonstrate uncertain data. To compare triangular fuzzy ...

متن کامل

High-Dimensional Unsupervised Active Learning Method

In this work, a hierarchical ensemble of projected clustering algorithm for high-dimensional data is proposed. The basic concept of the algorithm is based on the active learning method (ALM) which is a fuzzy learning scheme, inspired by some behavioral features of human brain functionality. High-dimensional unsupervised active learning method (HUALM) is a clustering algorithm which blurs the da...

متن کامل

Similarity-based data mining in files of two-dimensional chemical structures using fingerprint measures of molecular resemblance

This paper reviews the use of measures of inter-molecular similarity for processing databases of chemical structures, which play an important role in the discovery of new drugs by the pharmaceutical industry. The similarity measures considered here are based on the use of a fingerprint representation of molecular structure, where a fingerprint is a vector encoding the presence of fragment subst...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Journal of Chemical Information and Computer Sciences

دوره 39  شماره 

صفحات  -

تاریخ انتشار 1999